-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add notebook for "Evaluating AI Search Engines with the judges
Library"
#270
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly? |
👋 @merveenoyan sorry about that! All of the commits from that PR are the same in this one except the most recent one. James won’t be able to finish up that PR for us so I needed to make a new one to ensure it gets the attention it needs — please let me know how else I can help make this smoother. I’m happy to copy over the comments from the previous PR as well if that helps! Otherwise, I think the only other option would be to open a PR -on top- of the other one, but you would need to merge as a repo owner since the PR was made by James and not me. |
@@ -0,0 +1,1680 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1680 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left some nits, otherwise looks good! @stevhliu should review too
@@ -0,0 +1,1680 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1680 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be easier to consume this content in table-form
| Judge | What | Why | Source | When to use |
|---|---|---|---|---|
| PollMultihopCorrectness | | | | |
| PrometheusAbsoluteCoarseCorrectness | | | | |
| MTBenchChatBotResponseQuality | | | | |
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a few more comments and then we can merge! 🤗
@@ -12,6 +12,7 @@ Check out the recently added notebooks: | |||
- [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct) | |||
- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm) | |||
- [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl) | |||
- [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put this notebook at the top of the list since it's the most recent, and then remove "Fine-tuning SmolVLM with TRL on a consumer GPU" to keep the list tidy
Description
This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.
This PR is a continuation of #257 -- shepherding the PR across!
What is judges?
judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:
The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.
What This Notebook Does
Open-Source Tools & Resources
Why This Notebook?
This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.